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Abstract — We propose the notion of sample distortion function 
for i.i.d compressive distributions with the aim to fundamentally 
quantify the achievable reconstruction performance of com- 
pressed sensing for certain encoder-decoder pairs at a given 
undersampling ratio. The theoretical SD function is derived 
for the Gaussian encoder and Bayesian optimal approximate 
message passing decoder thanks to the rigorous analysis of the 
AMP algorithm. We also show the convexity of the general 
SD function and derive two lower bounds. We then apply 
the SD framework to analyse compressed sensing for natural 
images using a multi-resolution statistical image model with both 
generalized Gaussian distribution and the two-state Gaussian 
mixture distribution. For this scenario we are able to achieve 
an optimal bandwise sample allocation and the corresponding 
SD function for natural images to accurately predict the possible 
compressed sensing performance gains. We further adopt Som 
and Schniter's turbo message passing approach to integrate the 
bandwise sampling with the exploitation of the hidden Markov 
tree structure of wavelet coefficients. Natural image simulation 
confirms the theoretical improvements and the effectiveness of 
bandwise sampling. 

Index Terms — Sample distortion function, bandwise sampling, 
sample allocation, turbo decoding 



I. Introduction 

TRADITIONALLY in compressed sensing a lot of work 
has been done in developing reconstruction algorithms 
assuming the optimality of the homogeneous random sensing 
matrix. There has recently been more attention on tailoring 
the sensing matrix in accordance with the signal of interest. 
We focus on designing a block diagonal measurement matrix 
for wavelet representation of natural images which falls under 
the general scope of bandwise sampling. 

Donoho pioneered the use of bandwise sampling for com- 
pressed sensing in his original paper [1]. Tsaig further ex- 
panded the idea through the concept of two-gender CS, which 
randomly samples the fine- scale wavelet coefficients while 
fully samples in the coarse-scale domain Q. Relatively better 
reconstruction quality is empirically shown over the homoge- 
neous Gaussian sensing matrix while both of them failed to 
provide a systematic sample allocation method. Optimizing 
the bandwise sample allocation of the sensing matrix was 
originally considered in [3| with the aim of minimizing the 
reconstruction uncertainty in terms of the entropy of the CS 
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approximation. However, quantifying the uncertainty is not 
easy and the authors are obliged to resort to an ad hoc solution. 

In fact, the notion of optimized bandwise sampling dates 
back much further and was instrumental in Kashin's proof 
of the optimal rates of approximation (n-widths) for certain 
classes of smooth function [4], which was a key inspiration for 
the theory of compressed sensing JT|. Specifically, bandwise 
sampling forms the basis of Maiorov's discretization theorem 
which relates function n-widths to the n-widths of a sequence 
of finite dimensional £ p balls @. 

In other recent work Krzakala et al. [6| explored a different 
aspect of selecting a good measurement matrix. As one essen- 
tial component, the block diagonal spatially-coupled sensing 
matrix was used to reach the fundamental undersampling limit 
of compressed sensing with almost perfect reconstruction J6|, 
|7|. However, to achieve the ground-breaking improvement, a 
first order phase transition must be present. Simply put, the 
insignificant components of the compressive signal must be 
extremely small (10 -6 of the large components in our signal 
model). We do not observe this level of compressibility in 
natural images and so this approach is not practical for the 
compressed sensing of real images. 

Main Contributions 

In this paper we begin by proposing the sample distortion 
(SD) function with the aim of assessing the performance of 
different encoding and decoding methods quantitatively in 
terms of the expected mean squared error (MSE) distortion. 
Within the stochastic compressive sensing setting, we employ 
two compressive statistical models: the generalized Gaussian 
distribution (GGD) and the two-state Gaussian mixture distri- 
bution (GMD). Both of the models have been used for imaging 
previously e.g. (U, 0, HO), lUTTl . As a broad definition, the 
SD function is applicable to any encoder-decoder pair, e.g. the 
Gaussian homogeneous encoder with the linear £2 decoder 
or the £\ minimum CS decoder. We mainly investigate the 
SD function for the Bayesian optimal approximate message 
passing (BAMP) decoder (HI, (13). The AMP algorithm 
admits a rigorous analysis in the large system limit and 
naturally provides the theoretical basis for the SD function of 
CS decoders lfT4l . [|T5l when used together with the Gaussian 
random encoder. 

In order to understand the SD function better we next 
derive a simple SD lower bound on the achievable MSE 
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performance of any linear encoder-CS decoder pair in terms 
of the differential entropy of the source. A generally tighter 
model based bound is further derived specifically for both 
GGD and GMD by leveraging the lower bound for a Gaussian 
source. Finally we show that the SD function is necessarily 
convex. This allows us to introduce a hybrid convexified 
CS encoder-decoder system. Specifically, we show that by 
randomly throwing away components of the source, a better 
SD performance can be achieved as the convex combination 
of the CS decoder and the trivial decoder x = in the low 
sample rate regime. 

The second part of the paper makes a contribution to the un- 
derstanding of optimizing the bandwise sampling in the multi- 
resolution statistical image model regime following the idea 
from 0. Built upon the bandwise independent wavelet model, 
we are able to define the SD function for each wavelet band, 
within which the wavelet coefficients are modelled as i.i.d 
random variables. By virtue of the convexified SD function 
for Gaussian encoder-CS decoder pair, the optimal sample 
arrangement is achieved by performing a greedy sample al- 
location. Simply put, we allocate samples progressively to the 
wavelet band with the largest distortion reduction, thus defin- 
ing the SD function for natural images. We then pursue further 
that direction to incorporate the wavelet dependencies with the 
bandwise sampling. The wavelet coefficients of natural images 
are known to have the persistence across scale (PSA) property 
fT6l which can be modelled using the hidden Markov tree 
(HMT) structure [8 |. We leverage Som and Schniter's recently 
proposed turbo decoding approach to alternate between the 
CS decoding and the tree structure decoding [17|. Instead of 
using a uniform distribution of samples across wavelet bands, 
we choose the optimized block diagonal sensing matrix to 
sample independently in the CS decoding procedure. We see 
that together with the exploitation of the wavelet tree structure 
via the turbo scheme, the accurate message acquired from 
the coarse scale bands propagates to the fine scale bands 
and eventually benefit the reconstruction. Theoretically the 
sample allocation obtained from the bandwise independent 
wavelet model is suboptimal. However, actual simulation with 
natural images indicates that it is a reasonable choice since we 
show that the empirically best sample allocation for the turbo 
decoding method is not substantially better. 

The remainder of the paper is organized as follows. We set 
up the sample distortion frame work in Section |TT] where the 
SD function for two types of compressive distribution is for- 
mulated. The convexified CS decoder is introduced based on 
the convex property of the SD function. In Section UTT] the SD 
function is expanded to the multi-resolution image model. The 
sample allocation for bandwise sampling is proposed for both 
the bandwise independent assumption and the hidden Markov 
tree structure. Comparison is made between the achievable 
distortion rates in the statistical CS setting and those known 
in the deterministic setting for Besov spaces. A turbo decoding 
approach is employed for the collaboration between bandwise 
sampling and the exploitation of the bandwise dependencies. 
Section [TV] presents the numerical results to compare different 
encoder-decoder pairs. We finish the paper with the some 
discussions and conclusion in Section [Vj 
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Fig. 1. SD functions for GMD data p(x) = 0.38 7V(0, 1.198) 
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II. Sample Distortion Framework 

A. SD function definition 

Suppose the signal of interest x G W 1 is a random 
vector (source) with i.i.d components drawn according to 
the prior distribution p(x). Denote <l> G ]R mxn , m < n 
as the measurement matrix or the encoder, and y = <l>x 
as the underdetermined linear combination of the source. 
Let S = m/n be the undersampling ratio. The goal of 
statistical compressed sensing is to reconstruct x using some 
Lipschitz regular mapping A : R m — » R n based on the 
knowledge of y, and p(x). In our work, we are interested 
in the reconstruction quality for certain encoder-decoder pairs 
(<I>, A) at an undersampling ratio S, which is evaluated by the 
expected error distortion between the original signal x and the 
estimation A(<£x): 



^{*,A}(^) = -E||x- A(**x) 

n 



(1) 



Along the lines of the classical rate-distortion function in 
the communication field |[T8lL we define a sample distortion 
function for the compressed sensing setting. 

Definition 1: The Sample Distortion (SD) function is de- 
fined as the infimum of undersampling ratios for which there is 
an encoder-decoder pair, A), that can achieve an expected 
distortion D. 



D(5) 



inf £> { #,a}(<5) 



(2) 



Through a slight abuse of terminology, we will also use 
the term SD function to refer to the minimum distortion 
level a specific encoder-decoder pair can achieve at a fixed 
undersampling ratio for a given compressive source. In this 
paper we will concentrate on the Bayes optimal AMP (BAMP) 
decoder. We will consider two specific non-Gaussian distri- 
butions, the two-state Gaussian Mixture distribution (GMD) 
and the generalized Gaussian distribution (GDD), to model 
the compressive random vector. 
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As the combination of two Gaussian distributions with large 
and small variance respectively, the GMD model is quite 
effective at capturing the heavy tailed nature of an approximate 
sparse signal by adjusting the activity rate A. 



Pgmd 0) =P(X, 5 = 1)+ p(x, S = 0) 

=p(s = l)M(x;0,a 2 L ) 

+ p(s = 0)Ar(x;0,a 2 s ) 
=XJV(x; 0, a 2 L ) + (1 - X)JV(x; 0, a\ ) 



(3) 



A random vector with i.i.d GMD components can be seen as 
generated from either the small variance Gaussian distribution 
or from the large one, depending on the hidden states s = 
{0, 1}. Since coefficients with small magnitude are expected 
to dominate the signal domain for compressive signals, it is 
reasonable to assume A < 0.5. 

Another popular probabilistic model for compressive data 
is the generalized Gaussian distribution (GGD). The pdf for 
the GGD can be written as 



Pggd ( x ) 



2vWi) 



exp 



(4) 



where /3 = T(l/a)/T(3/a), a is the standard deviation and a 
is the shape parameter. As a goes to zero the distribution has 
increasingly heavy tails. For images we are typically interested 
in the GGD with a ~ [0.3, 1] since these distributions provide 
a good approximation for the distribution of the wavelet 
coefficients in a given band for natural images. 

B. SD function for BAMP 

Recent work by Donoho, Maleki and Montannari has 
shown that the AMP algorithm can achieve the same spar- 
sity undersampling trade-off as the corresponding i\ convex 
optimization procedure, but at less computational cost [H2ll . 
Furthermore, when the signal prior, p(x), is assumed to be 
known, the generic AMP algorithm can be tuned optimally 
by replacing the soft thresholding step with an optimal scalar 
MMSE estimator [[31, [19|. Moreover, the asymptotic MSE 
behaviour of AMP can be accurately predicted by a state 
evolution (SE) formalism 1(151 , called the cavity method in the 
context of statistical physics [7], thus providing a theoretical 
basis for the SD function for Bayes optimal CS when using a 
Gaussian encoder. For BAMP decoder, the distortion iteration 
can be derived from the SE function [Q2D, 0- 



D fc+1 =E{[F(x+^z;^) 



^2 



] 2 } 



(5) 



where x follows the choice of the compressive distribution, 
z ~ A/"(0, 1) is independent of x, and Do = E(x 2 ). The 
function F(; ) is the (non-linear) scalar MMSE optimal esti- 
mator for x given x + z, which has a close-form expression 
for GMD 0, ifTTl and can be solved numerically for GDD. 
The expectation in © is taken with respect to x and z and is 
in general calculated numerically. The SD function for BAMP 
decoder D BAMP (S) is then given by the fixed point 0of ©. 

^or the distributions considered in this paper there is only one fixed point, 
i.e. BAMP exhibits no phase transitions 
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Fig. 2. SD functions for GGD data a = 0.4, a = 1 and lower bounds 



Examples of the theoretical prediction for the SD function 
of GMD and GGD data using BAMP decoder can be found 
in Fig. [T] and Fig. [2] respectively. 

C. SD Lower bound 

To understand the fundamental theoretical limits of CS 
for compressible distributions, we now derive some bounds 
for the SD function. We first prove an entropy based bound 
(EBB) which is a sampling analogy to the classical Shannon 
Rate Distortion Lower Bound (this result first appeared in the 
conference paper [20]). 

Theorem 1: Let x £ R n be a realization of the random 
vector x = x±,--- ,x n , i.i.d. ~ p(x), Var(^) = 1 and 
h(xi) < oo. Let y = <I>x, y e M 5n , S = m/n < 1. Then 
for any Lipschitz reconstruction decoder A : R m — >> R n , we 
have: 

D±(S) > (1 - £)2 2 (M*)-^)/(i-5) (6) 

where h g = | log 2 2ne is the entropy of a unit variance 
Gaussian random variable. 

The proof is given in appendix lAl 

The EBB can be easily rescaled to bound the SD per- 
formance for distributions with non-unit variance. When the 
source x is Gaussian the second term in the lower bound 
becomes 1. The EBB for a Gaussian distribution reduces to the 
well known form: D EBB = 1—5 which can be shown to be tight. 
Furthermore when we use the (optimal for Gaussian source) 
linear estimator, x = $ ^y, ^ j s straight forward to show that 
the SD function, D^ 2 {8), has the same form independent of 
the pdf of the source x and is achievable with any full rank 
linear encoder. 

While the EBB in Theorem [T] provides a bound on the 
achievable performance of CS specifically for i.i.d sources, 
it is not clear how close we can expect to get to it. The EBB 
for both GMD and GGD data are plotted in Fig. Q] and Fig. 
[2 We can see that at low undersampling ratios, it is not tight 
since we expect the SD function, i.e. the MSE, to approach 
the signal energy as 5^0. 
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To see this we can define a model based bound (MBB) 
to compensate for the disadvantage of the EBB. Inspired by 
the fact that the EBB is tight and achievable for Gaussian 
source, we resort to the hierarchical Bayesian model to approx- 
imate the target compressible distributions. By introducing the 
variance as a latent variable, the hierarchical representation 
of a compressive distribution p(x) can be understood as the 
weighted sum of (possibly infinite) Gaussian distributions. 



p(x) 



p(x\r)p(r) dr 



M[x\ 0, r)p(r) dr 



(7) 



where p{r) is the weight for the Gaussian component 
A/"(x;0,r). The MBB is then derived in the following man- 
ner: assume we partition the source x into different groups 
according to the variance. For both encoder and decoder, we 
agree to transmit and reconstruct the source group by group 
in the descendant order of the variance. For each Gaussian 
group, its SD performance is tightly bounded by the EBB. 
Then the lower bound for the whole procedure can be seen as 
consisting of weighted combination of the EBB of Gaussian 
components. Thus the MBB has the form: 



A-bb(*) 



/ rp(r) 
Jo 



dr 



(8) 



with S — J c °°p(t) dr. 

For the GMD model, the MBB is intrinsically a discretized 
hierarchical Bayesian model with only two Gaussian compo- 
nents. Thus its EBB can be seen as the discretized version of 
the general form: 



(1 - \)a% + (A - S)a\ < 8 < A 



{l-8)a% 



A < <5 < 1 



(9) 



For the GGD model, the detailed procedure for inferring its 
hierarchical Bayesian prior p(r) is relegated to appendix [B] As 
we can see in both Fig. [T] and Fig. [2 the MBB is much tighter 
than the EBB for small undersampling ratios, although neither 
the MBB nor the EBB dominates for the whole range of 
the undersampling ratios. The supremum of the two therefore 
yields a better lower bound for the SD function. 

D. Convexity property 

In this subsection, we further exploit the similarities with 
the rate distortion theory to show that the SD function D(S) is 
necessarily convex. We then show a direct application of this 
to improve the reconstruction quality of the Gaussian encoder- 
BAMP decoder pair in the low sample-rate regime. 

Theorem 2: The SD function D(S) is convex. 

Proof: Consider two achievable SD points (Si, D(Si)) 
and (£2, D(S 2 )). To prove the SD function is convex, we only 
need to show the convex combination of the two points is 
also achievable. Let S t = tSi + (1 — t)S2, < t < 1. To 
sample the source x £ R n at the undersampling ratio S t , we 
could split x into two parts x = [xi,x 2 ] T , where xi £ R tn , 
x 2 £ R( 1-t ) n , and apply encoders with undersampling ratio 
Si, S2 to xi, X2 respectively. Then the reconstruction of xi 



fo Wti I H 1 1 1 1 F 



Xl 



x 2 



A 



- xi 



Fig. 3. Hybrid zeroing Gaussian matrix as the convex combination of a 
trivial decoder x = and a BAMP decoder A. Elements equal to are 
represented with white blocks. 



and X2 has achievable MSE: tnD(Si) and (1 — t)nD(S 2 ). So 
the MSE of the reconstruction of X is: 

nD(S t ) < tnD(Si) + (1 - t)nD(S 2 ) (10) 

Therefore 

D(tSi + (1 - t)S 2 ) < tD(Si) + (1 - t)D(S 2 ) (11) 



The convexity property can be applied to the SD function 
for any specific encoder-decoder pair. A direct consequence 
of Theorem [2] is that for a given encoder-decoder pair with 
a concave SD function between Si and S 2 (Si < S 2 ), there 
exists a hybrid system with better SD performance: it can 
be easily achieved by applying the two encoder-decoders to 
different portions of the source to get the convex combination 
of D(Si) and D(S 2 ). A special case is when Si = with 
the corresponding trivial decoder (x = 0) and S 2 = S c 
with S c being the crucial undersampling ratio. In this case, 
instead of sampling the source x with a full Gaussian matrix, 
$ £ R Snxn , we split x as before with xi £ R tn and 
x 2 £ R( 1- ^ n , t = S/S c . We then sample xi with the Gaussian 
matrix, <1> £ R 5nxtn and reconstruct, while the remaining X2 
we reconstruct as zero. Since this is equivalent to setting part 
of the encoder to zero, $ = [3>,0], we call this the zeroing 
procedure, as illustrated in Fig. [3] 

Close observation of the SD functions for the Gaussian 
encoder-BAMP decoder system in Fig. [T] and Fig. [2] reveals 
that the curves are usually convex for high sampling ratios 
but concave for low sampling ratios. By applying the hybrid 
zeroing Gaussian matrix, we achieve a better SD performance 
for S below the crucial undersampling ratio S c . 

The Gaussian sensing matrix has been widely assumed 
within the CS community to be optimal in terms of CS 
performance. Indeed this has been proved to be the case for 
the distributions that exhibit exact sparsity [21] . However, 
under the assumption that the BAMP achieves the Bayes 
optimal reconstruction - this would follow, for example, if the 
replica method could be proved to be rigorous [7 1 - then the 
convexified procedure resulting from Theorem ^indicates this 
assumption to be false. It also shows that the optimal encoder 
is related to both the signal property and the corresponding 
decoding method. 
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III. Sample Distortion Function for Statistical 
Image Model 

In this section we leverage the SD functions denned above 
to study the SD behaviour of the compressive imaging. This 
enables us to investigate optimal bandwise sampling strategies 
with a fixed sample budget, in a similar manner to 0, 
but in terms of the expected distortion of the CS decoder. 
We begin by introducing the bandwise independent multi- 
resolution statistical model for natural images. 

Natural images are transform compressible: they have more 
concise representation in the wavelet domain. The wavelet 
decomposition of an image /(X), X G [0, l] 2 has the form 
EE): 

k j ; > i, k 

where fa^ are the scaling functions, ipj^ are the prototype 
bandpass functions such that together they form an orthonor- 
mal basis. The variables /i^ are in turn the scaling coefficients 
at scale i and Uj^ are the wavelet coefficients at scale j. We 
can group the coefficients into a single vector according to 
the scale or band 6 = [/x ,0;$, ^+1, • ■ • ] T and assign each a 
band index. For simplicity fi. is band 0, the coarsest wavelet 
coefficients group, w i9 is denoted as band 1, and the rest can 
be labelled in the same manner. Here we follow fl22l . [|23l 
and consider a simple statistical model defined directly on the 
wavelet coefficients. The band is always treated as Gaussian 
with variance cr , since these coefficients typically exhibit no 
sparsity. This can be seen as a worse case assumption in terms 
of its SD function. For the other bands, we model the wavelet 
coefficients within each band as mutually independent and 
impose a compressive distribution for each wavelet band. To 
be specific, ojj^ at scale j can be modelled as 



• GGD(0, (J 2 j,OLj) 



or 



(13) 



(14) 



where typically for natural images the distributions exhibit a 
self-similar structure with an exponential decay across scale, 
i.e. a 2 = 2-^cr 2 for the GGD and a 2 a j = a = S,L 

for the two- state GMD for some f3 > 0. For the bandwise 
independent image model, we assume an uniform activity rate 
Xj for each wavelet band in spite of the coefficient index. In 
particular, we define Xj := Pr{sj^ = 1}. 

A. Bandwise Sampling 

To keep things tractable we restrict ourselves to the class of 
linear encoders, y = <3>0, that are block diagonal and sample 
the different wavelet bands separately with the following form: 



V 



*1 



\ 



(15) 



where <$>i G M miXn %mi < rii puts rrii measurements to 
sample the zth band. The equality holds when the zth band 



is fully sampled with 3^ being an identity matrix. Otherwise 
<$>i is a possibly zero padded (for convexity) Gaussian random 
matrix. And y . = is the CS observation for each block. 

To derive the SD function for the multi-resolution images, we 
first consider the L wavelet bands as independent and parallel. 
The question then is how to allocate a fixed number of samples 
to the various bands, with the aim of minimizing the total 
reconstruction distortion. Let us assume for now that rrii, rii be 
continuous and Si = rrii /rii G [0,1]. The problem is reduced 
to the following optimization 



min ^ cr1niDi(mi/ni) 



i=l 
L 



(16) 



s.t. m- h — m and < mi < n^, i = 1, . . . , L. 



where Di is the (convex) SD function for band i normalized to 
have unit variance. Using Lagrange multipliers, we construct 
the objective function 

L = — ^2a 2 niDi(mi/rii) 

i 

— A(y^ rrii — m) 

i 

-y^fii(mi - rii) 

i 

Differentiating with respect to rrii and setting equal to we 
have 

dL 2 dDi dSi 

-crfrii^^ • A - fii + Vi = (18) 



(17) 



drrii 



or 



d5i drrii 
2 dDi A 

Define the distortion reduction function as 

noting that this function is non-increasing in terms of Si. Now 
applying the Kuhn-Tucker (KT) conditions we arrive at: 



(19) 



(20) 



with 



and 



rji(Si) - X - fii + ^ = 0, 
Lii(rii — rrii) = 0, iii > 0, 



Virrii = 0, vi > 0. 



(21) 
(22) 
(23) 



We therefore have three cases for the distortion reduction 
function. First, if < rrii < rii then iii = Vi = and the 
undersampling ratio, Si, is set so that rji(Si) = A. Next suppose 
that rrii = rii so that Si = 1. In this case, the KT conditions 
imply that 

rji(Si) > A, V(5, (24) 

In the final case we have rrii = and Si = 0. Here the KT 
conditions imply: 

rji(Si) < A, VS Z (25) 



6 



JOURNAL OF JAT E X CLASS FILES, VOL. 1, NO. 11, NOVEMBER 2002 



Convex BAMP 

Empirical BAMP with Oracle Activity Info 
MLB 




10 10 10 10 10 

Measurement Index for Band to Band 5 



Fig. 4. Distortion reduction function of 6 bands db2 wavelet decomposition 
of cameraman image using GMD model (including the low-pass band). The 
statistics is reported in Table |l] 



This gives us an optimal sample allocation strategy which 
is similar to the reverse water-filling idea in rate distortion 
theory [24] . We allocate samples to the band with the greatest 
distortion reduction value until another band has a greater one 
or that band has been fully sampled. The procedure is stopped 
when the total distortion reaches the desired level. 

To apply this idea to natural images we need to take account 
of the fact that rrii, rii and L are all discrete and finite. Thus 
we define a discretized distortion reduction (DR) function for 
each wavelet band. 



Vi( m i) ~ °i[Di(mi/ni) — Di((rrii + l)/n^)] 



(26) 



Suppose that rrii samples have been allocated to the ith band. 
The DR function gives the amount of distortion decreased by 
adding one more sample to that band. Then the number of 
samples allocated to the band i is 







mi s.t. r]i{rhi) = n 



if max?7i(mi) < n 
if mm rji(rrii) > n 
otherwise 



(27) 



where k is chosen so that rrii = m. With a convex SD 
function, the optimal allocation is again achieved by perform- 
ing a greedy sample allocation strategy. The DR function for a 
six band db2 decomposition based on a multi-resolution two- 
state GMD fitted to the wavelet coefficients of the cameraman 
image is illustrated in Fig. [4] One thing worth noting is that 
neither the convexity property nor the resulting greedy sample 
allocation method is restricted to the form of the decoder. 
For example the optimized bandwise sensing matrix can be 
designed in the same manner for the CS i\ decoder and the 
£2 decoder. 

B. Comparison to the Theory of Widths 

In l22ll . parallels are drawn between the statistical wavelet 
model we have considered here and the family of Besov 
function spaces. In particular, the authors argue that under 



appropriate conditions realizations drawn from the GMD or 
GGD based wavelet model almost surely lie in an associated 
Besov space. It is therefore interesting to explore the similar- 
ities and differences between the achievable distortion rates 
derived here and those known in the deterministic setting for 
Besov spaces. 

1 ) n-widths of Besov spaces : Consider the Lipschitz class 
of r-smooth functions on the interval [0, 1] and the unit ball, 



B^, defined as: 



{/:||/ (r) llp<l} 



(28) 



where denotes the rth derivative of / and the L p ball acts 
as the deterministic counterpart to the coefficient prior above. 

The £2 error of the best n-dimensional linear approximation 
for these spaces is known to scale as ~ n -r+i/p-i/2 f or ^ <= 
p < = 2 (25l Chapter 14, Theorem 1.1]. In contrast, the £2 error 
for the best CS reconstruction is characterized by the Gelfand 
width of which can be written as: 

d n (B r ) := inf sup{||ft|| 2 , h e n B r \. (29) 

& h 

and measures the uncertainty in B r v within the null space of <l>. 
Here, for 1 < p < 2 the best CS approximation error decays 
at the faster rate of ~ n _r , i.e. inversely proportional to the 
smoothness EH Chapter 14, Theorem 1.1]. This result was 
derived in Kashin's seminal paper [4], which is better known 
in the CS community for accurate bounds for the n-widths of 
l p balls in R n . 

2) Similarities and differences: Interestingly Kashin's re- 
sult relied on a discretization theory of Maiorov that uses 
a similar bandwise sampling to our own. Specifically Maiorov 
uses a subband decomposition of spline spaces to bound the n- 
width of B^ in terms of a weighted sum of finite dimensional 
n-widths for the individual subbands - effectively performing 
a bandwise sampling. Furthermore in both the deterministic 
and stochastic settings the allocation scheme is broadly the 
same: fully sample the first few low resolution subbands; then 
partially sample a number of intermediate subands; and finally 
set coefficients of all the higher resolution subbands to zero. 
However, in Kashin's theory, the number of partially sampled 
subbands grows as the distortion decreases and, indeed, it 
is this that accounts for the different rate of approximation 
compared with the best linear approximation. In contrast, 
in the sample allocation framework, the number of partially 
sampled bands, P, is bounded by the range of the distortion 
reduction function: 



p<0io g2 fa(o)Mi)). 



(30) 



For the two- state GMD model this bound is finite since from 
the MBB we can deduce that: 



ry(Q) 

7,(1) 



(7 



< 



L,0 



(31) 



5,0 



Note the same bound applies to the SD function for the 
MBB oracle decoder where the bandwise sampling is optimal. 
Hence, the fact that we do not get a growing number of 
partially sampled subbands implies that in the large system 
limit the CS approximation error will decay at the same 
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rate as for the best linear approximation. We can therefore 
conclude that the gains in CS solutions over optimal linear 
approximation for such a model are fundamentally limited. 
We can see this, for example, in Fig. [4] where we would only 
ever partially sample at most 3 subbands for the convexified 
BAMP decoder. 

C. Incorporating Tree Structure 

Until now we have developed a tractable sample allocation 
method for a multi-resolution image model by assuming the 
independence of the wavelet band. In this subsection we look 
beyond basic sparsity and incorporate the wavelet dependen- 
cies. We model the wavelet coefficients with the GMD and 
impose the hidden Markov tree (HMT) structure to the hidden 
states as in lU. To be specific, with the hidden states at the 
coarsest scale (band 1) being the "root", we connect each 
wavelet hidden state to the four "child" wavelet states below 
it to form the image quadtree (see Fig. [5]). The PSA property 
states that the coefficient is more likely to be small if its parent 
is small; similarly, it is likely to be large if its parent is large 
|[T6ll . In this case, the activity rate Xj yk for uj^ k depends on 
the activity rate of its parent on scale j — 1, Xj_i^ Pk , and the 
transition probabilities across scales. 

Xj, k =p(sj = l\sj-i = l)A j _i, Pfe 

+ p(s % = = 0)(1 - \j-i iPk ) (32) 

By incorporating the HMT decoding, we can provide a better 
estimation of Xj ik for each wavelet coefficient instead of using 
an identical Xj over the coefficient index k, thus improving 
the reconstruction quality. 

We first review the core principles of Som and 
Schniter's Turbo AMP decoding method ifTTl . Let u> = 
fe^i^2> ' ' ' i^-l] T denote the collection of the wavelet coef- 
ficients of different bands and s = [s l5 s 2 , • • • , s L ] T the corre- 
sponding hidden states vector. Assume y = [y , y , • • • , y L ] T 
is the CS observation vector using the block diagonal sensing 
matrix. In the Bayesian compressed sensing setting, the re- 
construction of u> from y is interpreted as approximating the 
posterior mean of the density p(u>\y): 

L 

P (u>\y) = J2Y[p(^-j\y y s )p( s ) 

s j=0 

L ^ rij rrij 

= ^ — ^p( s )]Jp( u j^j,k)]Jp(yj y t\^.j) 

j=o i s k=i t=i 

(33) 

where Zj ensures the normalization fp(uij\y.) = 1. The 
factor graph plotted in Fig. [5] visualizes this gfobal function 
ll26lh (27). Exact computation of p(w\y) is NP hard due to 
the dense and loopy structure of the factor graph. Instead we 
split the factor graph along the dashed line into two subgraphs 
as in IfTTl and calculate the marginal posterior p(ujj^ k \y.). 
The essence of turbo decoding is to exchange the local belief 
of the hidden states sj jk between AMP decoding and HMT 
decoding alternately, by treating the "extrinsic" information 
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band 1 band 2 band 3 



Fig. 5. Factor graph for bandwise sampling with HMT decoding. The upper 
graph illustrates a quadtree structure of the wavelet hidden states. The lower 
graph is the bandwise independent random mixing. 

on Sj y k from one decoding procedure as prior for the other 
decoding procedure. Here, unlike [fTTl . the AMP decoder is 
bandwise independent due to the block diagonal form of <l>. 
The interaction across different wavelet bands only comes 
from the HMT decoding. For a fully sampled wavelet band 
(e.g. band 1), the extrinsic information Xj^ passing from the 
BAMP decoding to the HMT decoding as the prior can be 
obtained as IfTTl : 

^. = P(Uj,k\ s 3,k = 1) 

hk p(u jik \s jik = l)+p(u jik \s jik ) = 

A^K fe ;Q 7 ^ L ) (34) 
Af((J jt k] 0, a 2 j L ) + N{uj ik \ 0, (T 2 j S ) 

We denote Xj, k as the oracle activity rate for Uj yk . The accurate 
information of the activity rate of the parent hidden states 
can then propagate through the HMT decoding, improve the 
estimation of the activity rate of their children states (the 
partially sampled bands, e.g. band 2, band 3) and benefit the 
final reconstruction. 

It is obvious that the SD function for the bandwise in- 
dependent image model is not appropriate for the turbo 
decoding scenario since it does not take the HMT decoding 
into consideration. To see the impact of HMT decoding, we 
plotted an empirical SD curve for BAMP decoder with the 
oracle extrinsic information of the hidden states in Fig. Q] The 
empirical curve is generated from the Monte Carlo simulations 
with synthetic GMD data. To be specific, we use the Xj yk in 
(l34l) instead of Xj for the scalar MMSE estimator of each 
synthetic GMD component. Fig. [T] demonstrates that provid- 
ing the BAMP decoder with extra hidden states information 
dramatically improves the reconstruction quality with the SD 
function lying very close to the lower bound. On the same 
principle, we could establish the DR function with the oracle 
activity information for the multi-resolution image model, as 
shown in Fig. HI and the corresponding SD function. To clarify 
the terminology, we denote the sample allocation for bandwise 
independent wavelet model as SA and the one based on the 
HMT model as HSA. We should note here that neither SA 
nor HSA is optimal for turbo decoding. The problem with 
SA is that it tends to undersample the fine scale bands since 
they contain less energy than the coarse bands when treated 
independently. While HSA is served as the benchmark by 
assuming we have the oracle hidden state information for each 
wavelet coefficient. The optimal sample allocation for turbo 
decoding should combine the merits of both SA and HSA. 
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IV. Natural Image Example 

The last section illustrates that the multi-resolution SD func- 
tion for natural images can accurately predict the reconstruc- 
tion distortion with the proposed bandwise sample allocation 
scheme. It also demonstrates the effectiveness of the sample 
allocation in terms of improving the image reconstruction 
quality, in spite of the choice of decoder. All simulations are 
performed on the 256 x 256 grayscale "cameraman" image as 
an example. We consider the haar wavelet and db2 wavelet 
decomposition of "cameraman" to model the image statistics. 
Similar results are found in a wide range of other images. 

A. set up 

We begin by estimating the parameters for both GGD and 
GMD to set up the multi-resolution statistical model. Table 
U shows the GGD and GMD parameter estimation for the 6 
bands db2 wavelet coefficients of cameraman using moment 
matching [23 ] and EM algorithm |[28l respectively. 



TABLE I 

Statistics for db2 wavelet coefficients of cameraman 



As previously stated, the scaling coefficients are modelled 
as Gaussian with the shape parameter being 2 for the GGD 
and the activity rate for hidden states being 1 for the GMD. 
Variances for the wavelet bands exhibit an exponential en- 
ergy decay for both distribution models. Given the parameter 
estimation, we are able to generate the image SD function 
and the subsequent bandwise sample allocation using the 
aforementioned method. 

The performance of the proposed bandwise sampling matrix 
is compared with the homogeneous Gaussian sensing matrix 
and the two-gender sensing matrix. The former uniformly 
distributes samples across all bands while the latter fully sam- 
ples the scaling band and uniformly allocates the remaining 
samples to all the wavelet coefficient bands. To show the 
sample allocation method is not restricted to the form of the 
decoders, we consider three reconstruction options: the linear 
£2 decoder, the CS £\ decoder, and the BAMP decoder. The 
SPGL1 toolbox H is used to implement the £\ decoder. Its SD 
function can also be derived using the SE formalism [ 15 1. Both 
the £2 and the £\ decoder are considered for the GGD and the 
GMD model. Since there is no close form MMSE estimator 
for the GGD data, the BAMP decoder is only applied to the 
GMD model. The detailed algorithm can be found in ifTTl . ffl . 

Within the GMD regime, we further leverage the quadtree 
structure of the hidden states through the turbo scheme. Three 
different sample allocations are examined when incorporating 
with the HMT decoding: SA, HSA, and the empirically 
optimized sample allocation, or ESA. The ESA is obtained by 

2 http://www.cs.ubc . ca/labs/scl/ spgl 1 /index.html 
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Fig. 6. Sample allocation per band for db2 wavelet with the GMD model. SA: 
sample allocation based on the bandwise independent model. HSA: sample 
allocation with the oracle activity information. ESA: empirically achieved best 
sample allocation for turbo decoding. 



manually moving a minimum of 100 samples from other bands 
to the finest wavelet band based on the SA to balance to total 
distortion. For the turbo decoding, the extrinsic information 
Xj^k defined in (l34t is assigned as the activity prior for the 
HMT structure when band j is fully sampled. An identical 
activity prior Xj as reported in Table H is used for partially 
sampled bands. Other hyperparameters to initialize the HMT 
decoding are set in accordance with the recommendation 
in IfTTl . For various choices of sample allocations, we ran 
20 turbo iterations, within which 50 BAMP iterations are 
performed. 

For quantitative comparison, the signal to distortion ratio 
(SDR) is used for both theoretical performance prediction 
and the experiments with the image. We examined camera- 
man at 4 different undersampling ratios : 10%, 15. 26%, 25% 
and 30% associated with m = 6554, 10000, 16384, 19661 
noiseless measurements. The sample allocation for six bands 
db2 wavelet decomposition of cameraman with the GMD 
model is reported in Fig. [6] We see that the scaling band 
and the coarse wavelet band always have the priority over 
the fine wavelet bands in terms of allocating samples. The 
same sampling pattern is also observed for the haar wavelet 
decomposition. And the ESA is the balance between SA and 
HSA, as expected. 

B. results comparison 

The reconstruction performance comparison among differ- 
ent choice of encoder-decoder pairs is twofold. First, the 
wavelet bands are assumed as mutually independent. We 
model the db2 wavelet coefficients of the cameraman image 
with both GGD and GMD model. The SDR results for 
different decoders with and without sample allocation are 
shown in Fig. [7] and Fig. g] for the GGD and the GMD 
respectively. The theoretical distortion prediction and lower 
bound are plotted as lines. And the actual distortion obtained 
with the cameraman image are represented as dots. From 
the figures, we observed that the SD functions predict the 
expected distortion quite accurately for all three choices of the 
decoder with the corresponding optimized sample allocation. 
The advantage over the homogeneous matrix and the two- 
gender matrix can be easily seen: there is roughly 4db SDR 
gain over Uniform+Ll and 2 Gender+Ll at 10% and 15% 
Nyquist rate and 2db gain when S = 25%, 30% for both 
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Fig. 7. SDR comparison of different encoder-decoder pairs for cameraman Fig. 8. SDR comparison of different encoder-decoder pairs for cameraman 
db2 wavelet with bandwise independent GDD model db2 wavelet with bandwise independent GMD model 



statistical models. For GMD with the BAMP decoder, there is 
an average of 2db gain over the unstructured sensing matrix 
for the GMD model. The l\ decoder always performs better 
than the £2 decoder. And the BAMP decoder is the best among 
the three. The reason is that the t\ decoder makes use of the 
sparsity property. And the BAMP decoder is provided with the 
signal statistical information. Similar results are also observed 
for the haar wavelet decomposition. 

Secondly, the HMT structure of the wavelet bands are taken 
into consideration. The comparison for different sample allo- 
cation proposals are made within the turbo decoding regime. 
We use the GMD model and the BAMP decoder for both 
haar wavelet and db2 wavelet coefficients. It is the joint use 
of optimized bandwise sampling and the turbo approach that 
delivers by far the best SDR performance, as evident in Fig. 
and Fig. [TO) Again, sample allocation shows its importances 
when there is a tight budget of samples: even without the turbo 
decoding procedure, SA+BAMP is 2db better at S = 0.1 and 
ldb better at S = 0.15 than Uniform+TurboAMP. We observed 
that the ES A is only slightly better than the independent model 
based sample allocation. It means that we should guarantee 
the coarse- scale bands are fully samples when we are short of 
samples since their energy dominates the whole image. Even 
when we have the luxury of manipulating samples, the benefit 
from HMT decoding is limited because of the energy boundary 
of the fine-scale bands. 

The 256 x 256 cameraman image along with the recon- 
structed images by different encoder-decoder pairs are visu- 
alized in Fig. [TT] at the undersampling ratio S = 15%. It 
further confirms the improvement achieved by implementing 
the bandwise sampling strategy while the gain of adding the 
HMT ingredient is not visibly substantial. 

V. Conclusion 

The main contributions of this paper are the set-up of 
the sample distortion framework for statistical compressed 
sensing, the investigation of its property and its extension 
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Fig. 9. SDR comparison of different sample allocation with turbo decoding 
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Fig. 11. Reconstruction using 10000 (15%) samples of the 256 x 256 cameraman image with different encoder-decoder pairs. The GMD is used to model 
the db2 wavelet coefficients statistics. 



extension to multi-resolution statistical image models. We have 
derived a tractable sample allocation method for minimizing 
the reconstruction distortion and shown that it provides an 
accurate prediction of the achievable SD performance for 
natural images. Finally, inspired by the turbo decoding ap- 
proach of Som and Schinter, we have explored the use of tree 
structured sparsity within the optimized bandwise sampling 
framework. Various encoder-decoder combination examined 
with the cameraman image illustrates the merit of bandwise 
sampling, especially in the regime of very low undersampling 
ratios. Future work will explore the possibility of adaptive 
sample allocation to get closer to the SD lower bound. 



Appendix A 
Proof of Theorem[T] 

Without loss of generality we will assume that 3? is an 
orthogonal projection operator and we denote by ^ ± the or- 
thogonal projection onto the null space of <I>. We can then split 
the signal x into its observed and unobserved components: 
y = <E>x and z = $ x x. Since we directly observe y we need 
only consider the component of the decoder that estimates z, 

A (z) . R m _^ ^n-m m We can then estmiate x as; 

x = A(y) = 3> T y + [^f A^ (y) (35) 

We can further write the squared error distortion in terms of 
A (z) (y) as 

D = \j p(y) / p( z ly)ll z - ^ (z) (y)lli (36) 

Now consider the following decomposition of the differen- 



tial entropy h(s) of the vector x: 

fc(x) = h(y) + h(z\y) 

= h{y)+h{z-^\y)\y) 

< h(y)+h(z-A^(y)) 

< — log 2 2ne H — log 2 2iienD/(n — m) 

(37) 

where we have used the following observations 

• (line 2) The decoder is a deterministic function of y and 
therefore the differential entropy of h(z — A^(y)|y) = 
h(z\y). 

• (line 3) The conditional entropy is bounded by the 
marginal entropy: h(x.\y) < /i(x). 

• (line 4) The entropy of a random variable with a fixed 
covariance is bounded by the entropy of a Gaussian with 
the same covariance. Similarly the entropy of a random 
vector v £ R n_m under the constraint that E{v T v} = 
nD is bounded by the entropy of a Gaussian random 
vector with covariance ? nD ^ T . 

(n—m)l 

The principle here is that the optimal projection should 
maximize the entropy of the observed component h(y) while 
the decoder, A(y), should minimize the distortion possible. 
This is similar to the concept of information sensing proposed 
in 0. 

Substituting 5 = m/n into (f37l) gives: 

h[x) < log 2 2^e + X - log 2 2ne (38) 

where we have used the i.i.d assumption to write h(x) = 
nh(x). This can then be rearranged to give the EBB. 
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Appendix B 

Derivation of the Hierarchical Bayesian Model 
for the GGD 

Here, we derive the hierarchical Bayesian model to describe 
the GGD, which is then used to bound the MSE performance 
described in the main text in Sec. lII-CI We introduce two latent 
variables c\ and c 2 to simplify the expression of GGD: 



ci 



■ c 2 



(39) 



Then the pdf of GGD can be written as 

Pggd(^) = ciexp(— !^-) (40) 

Let p(x\r) = M{x\ 0, r). To establish the hierarchical model, 
we need to find the prior p(r) which satisfies: 



M[x\ 0, t)p(t) dr = ciexp(-^^) (41) 
o c 2 



Using the substitution g{r) 



-==p(r), m = :I — and t 



the question becomes solving g{r) subject to 



Jo 



exp( dr = ciexp(— tm/ 1 ) 



(42) 



let z = \ and G(z) = g(r)\ T= i, we further transform the 
problem to find G(z) subject to 
poo G(z) 

/ exp(— zm) — — dz = ciexp(— tvrfi ) (43) 
Jo z 

Applying the integral formula [|29l : if J °°e~ zt y(t) dt = f(z), 
then y(t) = C~ 1 (f(z)), we obtain 

G(z) 



ciC x exp( )] 



C2 



(44) 



where is the inverse Laplace transform. The inversion 

of Laplace transform in (l44t can be solved numerically [1301 . 
From here we obtain the MBB for the GGD data in Fig. [2] 
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